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ABSTRACT 

This paper reports on our analysis of the 2011 CAMRa Chal- 
lenge dataset (Track 2) for context-aware movie recommen- 
dation systems. The train dataset comprises 4 536 891 ra- 
tings provided by 171 670 users on 23 974 movies, as well as 
the household groupings of a subset of the users. The test 
dataset comprises 5 450 ratings for which the user label is 
missing, but the household label is provided. The challenge 
required to identify the user labels for the ratings in the test 
set. 

Our main finding is that temporal information (time la- 
bels of the ratings) is significantly more useful for achieving 
this objective than the user preferences (the actual ratings) . 
Using a model that leverages on this fact, we are able to 
identify users within a known household with an accuracy of 
approximately 96 % (i.e. misclassification rate around 4 %). 

Categories and Subject Descriptors 

G.3. [Probability and Statistics]: Correlation and re- 
gression analysis; 1.2.6 [Learning]: Parameter learning 

General Terms 

Algorithms, Performance 

1. INTRODUCTION 

The incorporation of contextual information is likely to 
play an ever-increasing role in recommendation systems be- 
cause of the broad availability of such information, and the 
need for more accurate systems. Among sources of contex- 
tual information, the social structure of a given pool of users 
is particularly interesting in view of the potential conver- 
gence between online social networks and recommendation 
systems. 

In this paper we investigate the relation between social 
structure and users behavior within a recommendation sys- 
tem, through the analysis of the CAMRa 2011 dataset (Track 
2). Our results are summarized in Table [1] 
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Any size 


Size 2 


Size 3 


Size 4 


| Misclassification rate 


0.0406 


0.0413 


0.0268 


0.0463 



Table 1: Best misclassification rates obtained for the 
challenge data set (Track 2). We report the aver- 
age misclassification rate over all households, aver- 
age over all households of size 2, of size 3 and of size 
4 respectively. 



In the remainder of this section we describe the challenge 
data set, we explain the performance metrics used, we give 
an overview of the algorithms we propose and their corre- 
sponding results, and finally we give a short overview of 
related work. 

1.1 Description of the data set 

The training data consists of a collection of 4 536 891 
ratings. Each entry (rating) takes the form 

(U.My.ty). (1) 

Here i G [m] (with m = 171 670) is a user ID, j £ [n] (with 
n = 23 974) is a movie ID, M i3 (with < My < 100) is the 
rating provided by user i on movie j, and ty is the time- 
stamp of that rating. (Throughout this paper we denote by 
[N] = {1,...,JV} the set of first N integers.) We denote 
by E C [m] x [n] the subset of user-movie pairs for which a 
rating is available. 

The training data also includes information about the 
household structure of a subset of users. This provided in 
the form of 290 household-composition tuples 



n. 



,1k) 



(2) 



Here H is a household ID, and i\, . . . ,ii are the IDs of users 
belonging to household H . The number L of users in the 
same household varies between 2 and 4. We will write i 6 H 
to indicate that user i belongs to household H. For instance, 
given the above tuple, we know that ii, . . . , il £ H. 
The test data comprises 5 450 tuples of the form 



(H,j,M H j,t Hj 



(3) 



whereby H is an household ID, j is a movie ID, Mnj is a 
rating provided by one of the users in H for movie j, and 
tnj is the corresponding time-stamp. The challenge Track 2 
requires to infer the user i £ H that actually provided these 
ratings. 

In the following, we denote by Train the train set, and by 
Test the test set. 



1.2 Performance metrics 

Of the 290 households, the vast majority, namely 272, 
is formed by 2 users, while 14 include 3 users, and only 4 
are formed by 4 users. As a consequence of this, a purely 
random inference algorithm achieves an average misclassifi- 
cation rate over all households that is slightly above 50 % 
(indeed, approximately 0.511). The same random inference 
algorithm achieves an average misclassification rate of 50 % 
over households of size 2, of 66 % over households of size 3 
and 75 % over households of size 4. This performance pro- 
vides a baseline for the algorithms developed in this paper. 

As a performance metric we will use standard ROC vari- 
ables (true positive rate and one minus false positive rate). 
More precisely, given a household with two users i — 1 and 
i = 2, we let Tl and T2 be the total number of entries in 
Test, that correspond to user 1 and user 2 respectively while, 
TPl(Alg), TP2(Alg) are the the number of those entries as- 
signed by algorithm Alg to 1 and 2. Then the corresponding 
true positive rates are 



TPRl(Alg) 



TPl(Alg) 



TPR2(Alg) 



TP2(Alg) 



(4) 



Tl v °' T2 

Notice that TPR2(Alg) is equal to one minus the false pos- 
itive rate in predicting 1, so these are the usual ROC vari- 
ables. This definition is generalized in the obvious way in 
the case of 3- and 4-user households. 

The total misclassification rate per household H is defined 
as follows in terms of the above quantities (always consider- 
ing 2-user households but easily generalized) 



P(Alg,fl) = l- 



TPl(Alg)+TP2(Alg) 
Tl +T2 



(5) 



We define P to be the average of P(Alg, H) over all house- 
holds. We also compute the average of P(Alg, H) over house- 
holds of size 2 only, of size 3 only and size 4 only. We denote 
these values by P2, P3 and P4 respectively. 

In order to obtain a 2-dimensional ROC curve, we will 
plot the true positive rate for -say- user 1 against the true 
positive rate for the union of users 2 and 3. 

1.3 Overview of algorithms and results 

We will consider three classes of methods that incorporate 
increasing amounts of contextual information: 

1. Low-rank approximation, cf. Section (2[ provides an ef- 
fective tool to embed the collection of movies and users at 
hand, within a low-dimensional latent space R r , r <gi m,n. 
A high rating provided by user i on movie j corresponds to 
latent space vectors with large inner product. We use the 
latent vectors associated with users within the same house- 
hold to infer which user rated a certain movie, by selecting 
the latent vector whose inner product with the movie vector 
best reproduces the observed rating. Generalizing [11], we 
extend these models to include temporal variability, in both 
users' and movies' latent vectors. If our temporal units are 
the 12 months of the year, the resulting model achieves an 
overall misclassification rate P ~ 0.3735. 

2. The second group of methods, cf. Section [3] makes a 
crucial use of temporal patterns in the users rating behavior. 
Indeed, our single most striking discovery is that different 
users within the same household exhibit very well separated 
viewing habits. These habits are clearly demonstrated by 
comparing the distribution of ratings across the days of the 
week for two users in the same household. For a large num- 



ber of households, these distributions have almost disjoint 
support. A simple algorithm that uniquely uses the day of 
the week to infer the user identity, achieves a misclassifica- 
tion rate P « 0.1154. We also discuss a generative model 
which incorporates both ratings (through low-rank approx- 
imation) and temporal patterns, achieving P ~ 0.0950. 
3. Section [4] proposes a unified framework based on binary 
classification to exploit latent space information as well as 
temporal information, and additional contextual informa- 
tion. The binary classification 'module' we use is regular- 
ized logistic regression, but could be replaced by a number of 
equivalent methods. By using composite feature vectors in- 
cluding several types of information, we achieve P « 0.0406. 

1.4 Related work 

Several aspects of our investigation confirm claims of ear- 
lier work, such as the usefulness of low-rank approximation 
[31 [12] and the importance of accounting for temporal evo- 
lution [121 [5] . At the same time, the present dataset allows 
us to provide striking evidence of these two points. Fur- 
thermore, the precise form of temporal patterns and their 
extraction in the form of weekly and daily habits is novel 
and extremely powerful. 

The importance of the time of day as context for recom- 
mendations has been noted in the past, e.g., in recommend- 
ing music tracks [TJ[2]. Our most striking finding is that, in 
the challenge dataset, users within a given household tend 
to view and rate movies at different times of the day and dif- 
ferent days of the week. Thus, time is an important factor 
not only in recommendations but also in user identification. 

2. LOW-RANK APPROXIMATION 

This section consists of three parts, dealing respectively 
with rating prediction from a training set, rating classifica- 
tion in a test set, and evaluation of the misclassification rate 
on the challenge data set. We first propose two collabora- 
tive filtering methods, based on low-rank matrix completion, 
to predict the missing ratings in a training set. The first 
method relies only on the ratings provided in the training 
set to predict the missing ratings. The second method also 
factors in the context by taking into account the temporal 
information in the training set. We then turn our attention 
to the test set, containing household ratings, and use the 
aforementioned prediction models to identify which user in 
a household provided a given rating in the test set. Finally, 
we evaluate our methods on the challenge dataset, and pro- 
vide empirical results in terms of misclassification rate and 
ROC curve. 

Throughout this section, we denote by x ~ U[a, b] a ran- 
dom variable x uniformly distributed in [a, b]. For x,y £ R n , 
(x, y) = x T y — Yli=i X W denotes the usual inner product, 
and ||a;|| 2 = (x,x). For M € R mxn , ||M|| F is its Froebenius 
norm. We let l n = [1, . . . , 1] T , and I n be the identity matrix 
of size n. 

2.1 Simple low-rank approximation 

2.1.1 Model 

A simple low rank model is obtained by approximating 
the matrix of ratings M € R mx ™ by a low-rank matrix M = 
UV -j-Zln, where matrix U — [m \ ■ ■ ■ \u m ] T is of size m x r, 
matrix V = [«i| • • • \v n ] T is of size n x r, and the column 
vector Z — [zi , ■ ■ ■ , z m ] T is of length m. Each vector Ui £ R r 



Algorithm 1 Low rank approximation 
procedure Initialization 

V(i,i) S= [m] x [r], M?°> ~ 

V(i,j) 6 [r] x [n], ~ 
Vi £ [m], zf ] = 50 

procedure lTERATIONs(.fS") 

for k — 1 . . . K do 

for i — 1 . . . m do 

(fc) f-iAk— 1) ,,T , (fc— 1) 

U i = 9 ^Bj < M iEi ~ 1 \E i \Z\ ', AJ 

for j ' — 1 ... n do 

(ft) (T T (k) T , r \\ 

= 9( C/ f/ • M *jj "4- > X ) 
for z — 1 . . . m do 

Return (t/< K > , V (A "> , Z< K "> ) 



is associated with a user i £ [m], and each vector Vj £ R r 
corresponds to a movie j £ [n] . The column vector Z models 
the rating bias of each user. Matrices U, V and Z are found 
by minimizing the following regularized empirical £2 loss 

C(U,V,Z)=± {Mij-iui^-zif 

(i,3)es (6) 

+\\\uf^\\v\\l. 

2.1.2 Alternate minimization 

The cost function © is non convex, but several iterative 
minimization methods have been developed with excellent 
performances in practical settings 15, 14, 7, 13, 16 . Perfor- 
mances guarantees for algorithms of this family were proved 
in [HE], under suitable assumptions on the matrix M. Al- 
ternative approaches based on convex relaxations have been 
studied in [IE]. 

In this paper we adopt a simple alternate minimization 
algorithm (see e.g. [Ill [7] for very similar algorithms). Each 
iteration of the algorithm consists of three steps: in the first 
step, V and Z are fixed, and U is updated by minimizing 
([6]); then U and Z are fixed, and V is updated; finally, U and 
V are fixed and Z updated. A pseudocode for the algorithm 
is presented in Algorithm [T] The algorithm stops after K 
iterations, and returns the triplet (U, V, Z). 

Since the cost (JSJ) is separately quadratic in each of U, V 
and Z, each of the steps can be performed by matrix inver- 
sion. In fact, the problem presents a convenient separable 
structure. For instance, the problem of minimizing over U 
is separable in m, 112, . . . , u m . Minimizing C(U,V,Z) over 
a vector m is equivalent to a Ridge regression in Ui, whose 
exact solution is given by 

Ui = (V Ei V Ei T + A/ r ) ~ 1 Vb 4 {Mi Ei - ^lf El |) T . (7) 

where E t = {j £ [n]\(i,j) £ E}, M zEz = [m^].,-^ £ 
R lx|Bil , and V Ei = [vj] jeEi G K rx|i5il . In order to con- 
cisely represent this basic update, we define the function 
g as follows. Given a matrix A £ R rxn , a column vector 
x £ R n , and a real number a, /3 £ R, we let g(A,x, a) = 
(AA T + air) -1 Ax. The above update then reads Ui = 
g(V Ei , Mf E . - A). Define F 3 — {i £ [n]\(i,j) £ E}. 

proceeding analogously for the minimization over V and Z, 
we obtain Algorithm [T] 



2.2 Low rank approximation with 
time-dependent factors 

In this section, we extend the previous low-rank prediction 
model to account for temporal information. 

2.2.1 Model 

In this model, we bin time into T bins of equal duration, 
indexed by b £ {1, . . . ,T}. Given that user i rates movie j 
at time tij, we denote by b(tij) £ [T] the unique bin index 
for the observed rating of the pair (i, j). 

Let M £ R mxnxT be the three-dimensional rating tensor 
whose entry Mij(b) represents the rating that user i £ [m] 
would give to movie j £ [n] at a time in bin b £ [T]. The 
matrix M(b) £ R mxn represents the rating matrix in bin b. 
From a training set of observed ratings {Mij(b)\(i, j) £ E}, 
we predict the missing ratings by approximating each matrix 
M(b), b £ [T] by a low rank matrix M(b) = U(b)V(b) T + 
Z(b)ll. This is a natural extension of the model in Sec- 
tion [2IT] Matrices U{b) £ R mxr , V(b) £ R nxr and Z{b) £ 
R mxl are stacked in the tensors U £ R mxrxT , V £ R rxnxT 
and Z £R mxlxT respectively. We obtain the tensors (Z7, V, Z) 
by minimizing the following regularized £2 loss 

C{U, V, Z) = n x , (n (10 + (V) + Ko, Sz (Z)+ 

\ ^2 (Ma (6(*y))- («i(b(*y)), Vj (biUj^-Ziibitij))) 2 , ® 

where the regularization terms are of the form 

T T—l 

K*.e(U) = I Ell^H 2 ^ + 1 • (9) 

6=1 6=1 

Each regularization function consists of two terms: the first 
term is an £2 regularization for shrinkage, while the second 
term promotes smooth time-variation. Note that by setting 
the number of bins to T = 1, this model reduces to the 
time- independent model described in Section \2. II The same 
happens by letting £ z — ¥ 00. 

2.2.2 Alternate minimization 

In order to minimize the cost function we general- 
ize the alternate minimization algorithm of Section 12.1.21 
Namely we cycle over the time bin index 6 and, for each b, 
we sequentially minimize over U(b), V(b) and Z(b), while 
keeping (7(6'), V(b') and Z(b'), b' / b fixed. As before, 
each of these three minimization problems is quadratic and 
hence solvable efficiently. Further, each of these quadratic 
problems is separable across user indices (for minimization 
over U and Z) or movie indices (for minimization over V). 
On the other hand, it is not separable across time bins be- 
cause of the second term in the regularization function, cf. 
Eq. As a consequence, the update steps change some- 
what. Consider -to be definite- the minimization over U. 
A straightforward calculation yields the following expression 
for the minimum over Ui(b), when all other variables are kept 
constant 

Ui(b) = (v E i(b)V El{b) T + (A + 2£ u )/ r ) 1 x 

(v Ei(b) (Mt Ei(b) - Zl (b)lf Edb)l ) T +£ u {ui(b + 1) + u,(b - 1))) 

where we assumed b £ {2, . . . , T — 1} (the boundary cases 
b = 1,T yield slightly different expressions). Defining 
h(A,x,y,a, f3) — (AA T + al r ) -1 (Ax + /3y), the above can 



Algorithm 2 Time-dependent low rank approximation 



procedure Initialization 

V(i,i, 6) e [m] x [r] x [T] , wy(i>)<°> - 

V(i,j,b) £ [r] x [n] x [T], -yy (*>) (0) ~ ^^r 1 
V(i, b) £ [m] x [T], 2 i (6(t)) <0) = 50 

procedure Iterations (.ft", T) 
for fc — 1 . . . K do 
for b = 1 ... T do 
for i — 1 . . . m do 

u t (b)< k > = h (v£- b ]\ MT E . w -l ]Eiib)] z i {b^ h - 1 \ u,(b+iy k -V + « 1 (b-l)< fc \ A + 2£„, e„) 
for j — 1 ... n do 

Mf<) (fe) ='*(^ > ( il ) T : M Fj(i))3 -* Fj (&)<*>, ^(b + l)^- 1 ' + ^(b-l)< fc >, A + 2f„, 
for a — 1 . . . m do 

*i(V)M=h(lT Btm , M iE . w T -v$> w t mv) w , Zi(b + V)0- i: >+M1>-l) W , sc.. £.) 

Return ([/< A "\ V'* > , Z< K > ) 



be written as Ui(6) = /i^Ve^), M iE .^ — l\ Ei (b)\Zi(b), Ui(b + 
1)+Ui(6-1), A + 2£„, 

Analogous expressions hold for minimization over Zi(b) 
and Vj(b). A complete pseudocode is provided in Algo- 
rithm n 

2.3 Household rating classification and results 

For each entry in the test set, the goal is to identify 
which user in the household provided the rating. In this 
section, our approach uses the rating and the corresponding 
time-stamp provided within the test set, and the low rank 
model obtained from the training set. Given a rating Mhj 
within household H — {ii, . . the simplest idea is to 

attribute the rating to the user i £ H for which the pre- 
dicted rating is closest to Mnj- In other words, we return 
argmin l6H \M H j - Mij(b{t H j))\- 

In order to explore the tradeoff between precision and ac- 
curacy through an ROC curve, we slightly generalize this 
rule by introducing a parameter a > 0, and proceed as fol- 
lows. 

(a) First, for each user i £ H, we compute the difference: 
\M HJ - M tJ (b(t Hj ))\. 

(b) Consider the first user ii £ H. If 

a\M Hj ~ M nj (b{t Hj ))\ < min \M Hj - hU 3 (b{t Hj ))\, 

we conclude that user i\ provided the household rating 
Mnj- Otherwise, we conclude it was some other user 
in the household. 

2.3. 1 Parameter selection and results 

We will limit ourselves to discussing the results obtained 
with time-dependent factorization, since this method leads 
to more accurate predictions, and it subsumes the time- 
independent approach as a special case. 

We evaluated the accuracy through cross-validation for 
several choices of the regularization parameters. Figure [1] 
shows the average misclassification rate versus the number 
of iterations for various values of parameters. The misclas- 
sification rate is close to 37%, and seems to become stable 
after about 50 iterations. We thus fixed K — 50, and se- 
lected the following values of parameters by minimizing the 
misclassification rate: number of bins T — 12; rank r = 10; 
regularization parameters A = 1, = 10, f« = £ z = 40. Let 




T=12, r=10, -10, 5„=? z =40 

— . — T=1, r»10, X-1 
— , — T=12, r-5, )W1 , 5 =10, 5 V =^ Z =40 
— . — T=5, r=1 0, X=1 , ^ u =1 0, 5 V =^ Z =40 
T=5, r-10, ? u =40, 5 V =5 Z =40 



10 20 30 40 50 

Number of iterations K 



Figure 1: Average misclassification rate vs. number 
of iterations K, for different values of parameters. 



us emphasize that we did not perform an exhaustive search 
over all sets of possible values, which could lead to further 
improvements. 

The results in Figure [TJ were obtained by random- 
subsampling cross-validation. We averaged over 5 different 
splits of the dataset into training set and test set. In each 
split, the test set was selected by randomly hiding approx- 
imately 4% of the data of each household. The curves ob- 
tained with the original training and test sets provided in 
the challenge are close to the ones in Figure [1] Our cross 
validation procedure is more reliable from a statistical point 
of view. We will keep to this procedure for the rest of the 
paper and only mention eventual discrepancies with respect 
to the original split in test and training set provided in the 
challenge. 

Figure [2] shows the ROC curve achieved by the present 
classification method, for varying a. Each point of the curve 
corresponds to the average of the pair (TPRl(a), TPR2(a)) 
over all households in a (Train, Test) pair, itself averaged over 
all (Train, Test) pairs (splits). Bars show the standard devi- 
ation from the mean over different (Train, Test) splits. 
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Figure 2: TPR of user 1 in each household vs. TPR 
of any other user. 

3. TEMPORAL SIGNATURES 

Although our matrix factorization model captures the evo- 
lution of user and movie profiles throughout the 12-month 
period of the dataset, it does not make direct use of the rat- 
ing time-stamp in order to classify ratings within a house- 
hold. The time-stamp is only used indirectly, namely to 
compute the predicted ratings My. 

On the other hand, temporal behavior — especially weekly 
behavior — appears to be extremely useful in distinguish- 
ing users within the same household. Household members 
exhibit distinct temporal patterns in their viewing habits. 
Rather than viewing movies together, in many households 
users consistently rate movies at different days of the week. 

As a result, the day of the week on which a movie is 
rated provides a surprisingly good predictor of the user who 
watched it. We exploit this finding below, and propose a 
generative model that incorporates the day of the week as 
well as the movie rating. 

3.1 Temporal patterns in user behavior 

Clear temporal patterns emerge when considering the day 
of the week on which ratings are given. Most importantly, 
the temporal patterns in the viewing behavior of members 
of the same household turn out to be very well separated. 

As an illustration, Figure [3] shows the frequencies with 
which users view movies on different days of the week for 
four households (labeled 1, 200, 203, and 266 in the training 
set). We see that, in households 1, 203, and 266, house- 
hold members tend to view and rate movies at very distinct 
days of the week. For example, in household 1, one user 
watches movies mostly on Sunday and Saturday, while the 
other watches movies in the middle of the week. 

This phenomenon is repeated in most of the households 
in the training set. In order to quantify our observation, 
let Pi(d) denote the empirical probability distribution of 
rating events associated with user i £ [m] over different 
days d £ W = {Sun, Mon, . . . , Sat} (normalized so that 
X^dew Pi(d) — 1). We define the average total variation 
of a household H as 
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Figure 3: Histograms of rating events across days of 
the week (day 1 is Sunday) for four households. The 
first three households have two members, while the 
fourth has three. For each day of the week, we plot 
\H\ histograms in different colors, each indicating the 
number of viewing events of a household member. 



where we recall that \\p — q\\rv = Yld&w l\p(d) — <f(<OI- 
By definition Sh S [0, 1], with Sh = 1 corresponding to a 
household in which no two users both rated a movie on the 
same day of the week (possibly in different weeks). 

Figure [4] shows the empirical probability distribution of 
Sh across different households H. The distribution of Sh 
is well concentrated around 1, with more than 70% having 
Sh > 0.8. This is a quantitative measure of the phenomenon 
suggested by Figure [3] 

3.2 Viewer prediction based on time-stamps 

In this section, we present three simple predictors of the 
household member who watches a movie. Our third predic- 
tor exploits the fact that the day of the week can serve as a 
very good indicator of which member is watching a movie, as 
suggested by Figure 3] Our predictors maximize the likeli- 
hood a given member rated a movie; each predictor assumes 
a different model of how movie ratings take place. 

The simplest model assumes that each time a movie is 
watched in household H, the user i £ H is chosen at random 
with distribution qn (i) independent of everything else. This 
probability can be estimated from the training set as follows 
for household H (we suppress the household subscript since 
this is fixed to H throughout): 



q(i) = 



Mi>j,ti'j) e Train 



Ki'^'M^V,) e Train : i' G H}\ 



I Pi 



Given a time t at which a movie is viewed, recall that b(t) £ 
{1, . . . , T} denotes the time bin. As in the previous section, 
we use T = 12 here (one bin per month). In the second 
model, the probability that the rating was given by user i 
depends only on the time bin b(t) in which it occurred, and 
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Sa 

Figure 4: Histogram of the average total variation 
distance 8h across the 290 households in the training 
dataset. The majority of households have an aver- 
age total variation close to 1, indicating that the 
distributions of rating events by different household 
members have almost disjoint supports. 

is independent from everything else, conditional on b(t): 

= [{(^M^ty^GTrain : i' =i A bjt,,,) = b(t)}\ 
911 ( >> |{i' 1 i,M 1 / J ,t i / J )6Train:i'6frA6(V J ) = 6(t)}r 

Finally, let d(t) G W = {Sun, Mon, . . . Sat} be the day of 
the week at which the viewing occurs. Our third model 
assumes that the user who rated the movie is independent 
from everything else, conditional on the day of the week: 

= 1 { ji' , j, Afg j , V j ) 6 Trai n : i' = i A d{t z , j ) = d(t) } \ 
q(l] UJ \{i>, j,M ilh V,)GTrain : i'£H A d(t i , j ) = d(t)}\ ' 

Given a tuple (H,j,MHj,tHj) G Test, we can consider the 
following three simple classification algorithms: 

argmaxg(i), argmaxq(i | b(tfij)), argmaxq(i | d(i,Hj))- 

Note that the second and third algorithms make use of the 
time at which a viewing event takes place. None of the three 
uses the actual rating Msj given by the user. We present 
an algorithm that does use the rating in the next section. 

3.3 Generative model 

In order to account for ratings given by the users in our 
prediction, we introduce a generative model for how users 
rate movies. Our model assumes that the rating given by 
a user is normally distributed around the prediction made 
by the low rank approximation algorithm of Section [2] In 
particular, recall that the predicted rating of a user i G [m] 
viewing movie j G [n] at time t is given by 

M«(6(f)) = Zi (b(t)) + ( Wi (6(*)),%(K*))> (10) 

where m, Vj G R r are the vectors associated with i and 
j, respectively, and z; is the centering component. This 
prediction depends on the time-stamp t only through the 
bin b(t). Figure [5j a) shows the distribution of the residual 
error 

Ma ~ My (&(*«)) 
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(a) All ratings (b) Ratings by single user 

Figure 5: PDF of the residual error across (a) all 
ratings in the training dataset and (b) all ratings 
given by a single user. The distributions are well 
approximated by normals. 

across all user/movie pairs (i, j) in the training set. The 
distribution seems to be well approximated by a normal dis- 
tribution, Figure[5jb) shows the distribution of residuals for 
a single user (user with ID 56094 in the training set). This 
still roughly agrees with a Gaussian distribution, although 
not as closely as for the overall distribution. 

This motivates modeling the rating given by a user i for a 
movie j at time t by a normal distribution N(Mij (b(t)), a), 
where My (6(f)) is given by (|10[) and a 2 is the variance of 
the residual error, as estimated from the training set. More 
specifically, given that a user from household H views a 
movie j at time tnj, we model the joint probability that (a) 
user i G H is the rater and (6) i gives a rating M as follows: 

V(i,M) = -e 53 q(i). (11) 

where S = \/2na 2 . Alternative models are obtained if we 
condition on the bin or the day of the rating, as discussed 
in the previous section: 

I (M-Mij( b (t H j))) 2 

P(», M j b(t Hj )) = - e ^ q(i | b(t Hj )), (12) 

i (M-M i: i(b(t Hj y)) 2 
JP(t, M | d{t Hj )) = - e ^ q(i | d{t H] )). (13) 

Given a tuple (H,j, Msj,tHj) G Test, the posterior prob- 
ability that i G H is the movie viewer under the above three 
generative models can be written as: 

P(i | M Hj , ■ ) = P(i, M„ 3 | ■ )/Y J Hi',M H3 | ■ ). 

i'eff 

As a result, the following rule can be used as a classifier of 
tuples (H,j,MHj,tHj) G Test: 

argmaxP(i, M H] \ ■ ) 

where P(i, Mnj \ • ) is given for each of the three generative 
models by (|11[1 . (|12[1 an (|13[) . respectively. 

3.4 Empirical results 

We evaluated the classification algorithms of Sections 13.21 
and 13.31 by cross validation on the training and test sets, 
as described in Section 12.3.11 For classifiers based on the 
generative models of Section 13.31 the low-rank model was 
selected to be the same as in Section [2.3.1 1 (in particular we 
used T = 12, r = 10, A = 1, f„ = 10, = 40). 
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9(0 

KtHj)) 


0.3916±0.0081 
0.3626±0.0080 
0.1129±0.0066 


0.3264±0.0102 
0.2956±0.0065 
0.1008±0.0066 


0.3066±0.0112 
0.2777±0.0084 
0.0966±0.0072 



Table 2: Misclassification rates P for algorithms of 
Sections 13.21 and 13.31 with standard deviations de- 
rived over five iterations of cross validation. 



The results are summarized in Table [2] in terms of the 
misclassification rate. The first column of the table (a — oo) 
corresponds to the classifiers of Section 13.21 (not using the 
ratings). The second and third columns correspond to the 
classifiers outlined in Section [3]3] In the second column, the 
variance a used in the normal distribution is estimated by 
the empirical variance of the residual errors over all ratings 
in the training set. In the third column, we used a user- 
dependent variance <Ji for each i £ [m]. This is estimated 
by the variance of the residual errors of ratings given by i. 
Finally, each row corresponds to a different assumption on 
the posterior probability q, with the second and third rows 
corresponding to the use of bin and weekday information, 
respectively (c.f. Eq. JTSJ and (fT3|) V 

We observe that, in all cases, using the bin information 
helps compared to using the unconditional probability q(i), 
but only marginally so. The largest improvement comes 
from conditioning on the day of the week. This decreases 
the misclassification rate by a factor between 3 and 4 com- 
pared to using the unconditional probability q(i). Incorpo- 
rating the generative model also decreases the misclassifica- 
tion rate: classification using the generative model condi- 
tioned on the day of the week, along with individual vari- 
ances <Tj, outperforms all other methods, with P « 0.0966. 

As mentioned above, these are misclassification rates esti- 
mated through five-fold cross-validation. We report these in 
detail because they provide a metric that is statistically more 
robust. When using the original split in train and test sets 
provided in the challenge, we achieve (for the third column, 
a = at) respectively P w 0.3028 (model q(i)), 0.2765 (model 
q(i\b(tHj))), 0.0950 (model q(i\d{tnj)))- For this same split, 
and for the model q(i\d(tHj)), the values for P2, P3 and P4 
are 0.0940, 0.1051 and 0.1315 respectively. 

Finally, these results remain excellent if evaluated in terms 
of ROC curves, and Area Under the Curve (AUC). We com- 
pute AUC as follows. Consider a household H , a user i, and 
the corresponding probabilities pj = P(i | Mnj, ■ )• Let a 
be the number of unordered pairs such that pj > py 

and j' was indeed rated by i, while j was not. Let b be 
the product between the number of entries in the test set 
that were rated by user i and the number of entries that 
were not. Define AUQ,h = 1 — a/b. AUC^fr is the area 
under the ROC curve for user i versus any other user in 
household H. We estimate AUC by averaging the above 
quantity over i and H in the test set for which b 7^ 0. Us- 
ing the original split in test and train set provided with the 
challenge dataset, we obtain (again for the third column, 
a = Oi) respectively AUC w 0.6170 (model q(i)), 0.6619 
(model q(i\b(t H j))), 0.8947 (model q{i\d{t H j)))- 

4. A UNIFIED FRAMEWORK 

While the generative models studied in the previous sec- 
tion yield excellent results, it is possible to improve upon 



them by including further contextual information. As an 
example, the rating time-stamp also provides us informa- 
tion on the time of the day at which the rating was entered. 
In many households, the separation of temporal patterns 
discussed in Section \3 . 1 1 becomes more acute when including 
the time of the day. This raises the need of developing a 
systematic scalable way of incorporating such information. 

Our approach is to formulate the problem as a supervised 
multinomial classification problem. The challenge of con- 
structing a classifier can then be decoupled in two separate 
two sub-tasks: (i) Constructing a generic multinomial clas- 
sifier (or choosing one from the vast literature on this topic) ; 
(ii) constructing a suitable set of features. 

In order to illustrate this approach, we describe it for a 
deliberately simple classifier: ^i-regularized logistic regres- 
sion. Furthermore, we reduce the classification problem to a 
binary one. Fix a household H, and a user i £ H (omitting 
hereafter reference to i and H whenever possible) . Each rat- 
ing event within household H is then characterized by the 
pair (y,0). Here y is a binary variable, equal to 1 if and 
only if the rating was provided by i, and O denotes collec- 
tively the other available information about the event. We 
then assume a logit model 



% = 1|C) 
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(14) 



whereby x{0) £ R p is a feature vector constructed from 
the available information, and 8 — Qi t u £ R p is a vector 
of parameters to be fitted from the data. Assuming the 
parameters are known, a rating will be attributed to user i 
if this maximizes the probability (|14|l among all the users in 
the same household. 

In order to learn the parameters 6 = 9i,H, we consider 
the training rating events within household H, and index 
them by s £ {1, ... , Nh}- Denoting the s-th such event by 
(y s ,(D 3 ), we consider the regularized likelihood 



log (1 + e 



[<>Mo s ))\ 



+ Ai 



Once again we emphasize that regularized logistic regression 
is not necessarily the best classification method, and our 
approach accommodates alternative algorithms. 

We implemented this procedure using lllogreg, a software 
that minimizes C(6) based on an interior point method de- 
scribed in [10]. All the data was standardized before being 
introduced into the solver. The algorithm was tested for dif- 
ferent feature vectors constructed by including at most the 
following: 

(a) The day of the week of the rating (i.e. imple- 
mented as a length-7 binary indicator vector. 

(b) The hour of the day of the vector, implemented as a 
length- 24 binary indicator vector. 

(c) The movie feature vectoiQ Vj(b(tij)) £ W , learned from 



The misclassificatio n rate P w 0.37 obtained using the low- 
rank model in Section ^. 3.1 l ean be lowered to 0.30 by binning 
the time-stamps into 7 different bins, one per day of the 
week. This suggests adopting a 7-bin model of vectors on 
a per week-day (rather than per month) basis. However, 
adopting a 7-bin model did not improve the performance of 
the other classification algorithms introduced in the paper, 
which rely on and outperform the low-rank model. This 
is also the case when, in the unified framework described in 





5 fold cross validation 


Challenge test set 


(a) 


0.1137 ±0.0077 


0.1142 


(a), (6) 


0.0483 ± 0.0039 


0.0570 


(a), (6), (c) 


0.0468 ± 0.0032 


0.0463 


(a),...,(d) 


0.0423 ± 0.0020 


0.0406 


(a),...,(e) 


0.0419 ± 0.0026 


0.0412 



Table 3: Misclassification rates P using the regular- 
ized logistic regression for Ai = 0.01 and sequentially 
including more features into the feature vector. The 
performance of our best predictor on the challenge 
test set is noted in bold. 

the low-rank model of Section 12.21 

(d) The time bin b(tBj) implemented as a length-12 binary 
indicator vector. 

(e) The actual rating Mij £ {0, 100} scaled and shifted 
so that corresponds to 1 and 100 to 5. 

Table shows how we reach our best values for P as we 
include more and more features in the feature vector. Al- 
though when doing cross validation including more feature 
seems to help, for the challenge test set, not including the 
rating produces best results. We note however that the way 
we are using the regularized logistic regression can be eas- 
ily improved by assigning different regularization weights to 
different components of the feature vector (right now we are 
using the same weight, Ai). This might explain why includ- 
ing certain features is not improving the results. 

With this choice of x(0), and Ai = 0.01, we achieved 
misclassification rate P = 0.0419 ± 0.0026 and area under 
the curve AUC = 0.9689 ± 0.0027, as estimated through the 
subsampling procedure described above. On the challenge 
test set, and not including the ratings in x(0), the same 
performance metrics evaluated to P = 0.0406 and AUC = 
0.9611. 

For the challenge test set the values of Pi, P3 and P4, are 
0.0413, 0.0268 and 0.0463 respectively. We note that the 
misclassification rate is smaller for households with 3 users. 
This is contrary to the natural intuition that the more people 
belong to a household the harder it should be to distinguish 
between them. 
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this section, we include in x(0) the vector Vj (d(tHj)) instead 
of Vj(b(t H j))- 



